[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

zaynt4606 · 2024-12-19T04:10:47Z

What changes were proposed in this pull request?

Retry seding RPC to LifecycleManager when TimeoutException.

Why are the changes needed?

RPC messages are processed by Dispatcher.threadpool which its numThreads depends on numUsableCores.
In some cases (k8s) the numThreads of LifecycleManager are not enough while the RPCs are a lot so there are TimeoutExceptions.
Add retry when there are TimeoutExceptions.

Does this PR introduce any user-facing change?

No.

Another way is to adjust the configuration celeborn.lifecycleManager.rpc.dispatcher.threads to add the numThreads.
This way is more affective.

How was this patch tested?

Cluster testing.

turboFei · 2024-12-19T04:56:07Z

docs/configuration/network.md

@@ -29,7 +29,7 @@ license: |
 | celeborn.&lt;module&gt;.io.enableVerboseMetrics | false | false | Whether to track Netty memory detailed metrics. If true, the detailed metrics of Netty PoolByteBufAllocator will be gotten, otherwise only general memory usage will be tracked. |  |  | 
 | celeborn.&lt;module&gt;.io.lazyFD | true | false | Whether to initialize FileDescriptor lazily or not. If true, file descriptors are created only when data is going to be transferred. This can reduce the number of open files. If setting <module> to `fetch`, it works for worker fetch server. |  |  | 
 | celeborn.&lt;module&gt;.io.maxRetries | 3 | false | Max number of times we will try IO exceptions (such as connection timeouts) per request. If set to 0, we will not do any retries. If setting <module> to `data`, it works for shuffle client push and fetch data. If setting <module> to `replicate`, it works for replicate client of worker replicating data to peer worker. If setting <module> to `push`, it works for Flink shuffle client push data. |  |  | 
-| celeborn.&lt;module&gt;.io.mode | EPOLL | false | Netty EventLoopGroup backend, available options: NIO, EPOLL. If epoll mode is available, the default IO mode is EPOLL; otherwise, the default is NIO. |  |  | 
+| celeborn.&lt;module&gt;.io.mode | NIO | false | Netty EventLoopGroup backend, available options: NIO, EPOLL. If epoll mode is available, the default IO mode is EPOLL; otherwise, the default is NIO. |  |  | 


cc @SteNicholas Seems the doc generation depends on the developer environment

it can not pass the GA, need to revert it.

codecov · 2024-12-19T12:00:11Z

Codecov Report

Attention: Patch coverage is 35.55556% with 29 lines in your changes missing coverage. Please review.

Project coverage is 32.55%. Comparing base (4aabe37) to head (09a7cdb).
Report is 59 commits behind head on main.

Files with missing lines	Patch %	Lines
...rg/apache/celeborn/common/rpc/RpcEndpointRef.scala	6.25%	15 Missing ⚠️
.../scala/org/apache/celeborn/common/rpc/RpcEnv.scala	0.00%	13 Missing ⚠️
...cala/org/apache/celeborn/common/CelebornConf.scala	93.75%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3008      +/-   ##
==========================================
- Coverage   32.88%   32.55%   -0.33%     
==========================================
  Files         331      336       +5     
  Lines       19800    20102     +302     
  Branches     1780     1800      +20     
==========================================
+ Hits         6510     6542      +32     
- Misses      12929    13195     +266     
- Partials      361      365       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

mridulm · 2024-12-20T21:08:17Z

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

    initDataClientFactoryIfNeeded();
  }

+  public <T> T callLifecycleManagerWithTimeoutRetry(Callable<T> callable, String name)


Instead of making changes everywhere - do we want to simply change askSync/askAsync to become retry aware ? With number of retries passed in as a param (for specific cases where we dont want retries for ex) ?

I agree to change askSync/askAsync.
There are a lot of exception changes caused by that the setupLifecycleManagerRef will throws RpcTimeoutExceptions which we need to catch. I change the Exception type to RuntimeException

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

turboFei

Thanks for the PR, I wonder that we can introduce two config items.

celeborn.rpc.retryWait for the default retry wait.

celeborn.client.rpc.retryWait for the client specific.

cc @pan3793

turboFei · 2025-01-13T11:09:29Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -4884,6 +4885,14 @@ object CelebornConf extends Logging {
      .timeConf(TimeUnit.MILLISECONDS)
      .createWithDefaultString("3s")

+  val RPC_TIMEOUT_RETRY_WAIT: ConfigEntry[Long] =
+    buildConf("celeborn.rpc.retryWait")


val RPC_RETRY_WAIT

And you can move this config to celeborn.rpc part.

I wonder that you can introduce a new config celeborn.client.rpc.retryWait for client end.

turboFei · 2025-01-13T11:10:06Z

common/src/main/scala/org/apache/celeborn/common/rpc/RpcEndpointRef.scala

@@ -30,6 +33,7 @@ abstract class RpcEndpointRef(conf: CelebornConf)
  extends Serializable with Logging {

  private[this] val defaultAskTimeout = conf.rpcAskTimeout
+  private[celeborn] val waitTimeBound = conf.rpcTimeoutRetryWaitMs.toInt


private[this] val defaultRetryWait

turboFei · 2025-01-13T11:11:11Z

common/src/main/scala/org/apache/celeborn/common/rpc/RpcEnv.scala

@@ -104,6 +106,7 @@ object RpcEnv {
 abstract class RpcEnv(config: RpcEnvConfig) {

  private[celeborn] val defaultLookupTimeout = config.conf.rpcLookupTimeout
+  private[celeborn] val waitTimeBound = config.conf.rpcTimeoutRetryWaitMs.toInt


private[celeborn] val defaultRetryWait

turboFei · 2025-01-13T11:12:52Z

common/src/main/scala/org/apache/celeborn/common/rpc/RpcEndpointRef.scala

+   * @tparam T type of the reply message
+   * @return the reply message from the corresponding [[RpcEndpoint]]
+   */
+  def askSync[T: ClassTag](message: Any, timeout: RpcTimeout, retryCount: Int): T = {


def askSync[T: ClassTag](message: Any, timeout: RpcTimeout, retryCount: Int, retryWait: Long)

turboFei · 2025-01-13T11:13:30Z

common/src/main/scala/org/apache/celeborn/common/rpc/RpcEnv.scala

+  def setupEndpointRef(
+      address: RpcAddress,
+      endpointName: String,
+      retryCount: Int): RpcEndpointRef = {


def setupEndpointRef( address: RpcAddress, endpointName: String, retryCount: Int, retryWait: Long)

Sory for delay.
This pr has been updated with two configraions CLIENT_RPC_RETRY_WAIT and RPC_RETRY_WAIT.

zaynt4606 added 4 commits December 19, 2024 11:59

retry when send Rpc LifecycleManager RpcTimeout

a928c40

retry in ont function

591ad34

reformat

b636fbb

config change

b63c232

turboFei reviewed Dec 19, 2024

View reviewed changes

zaynt4606 added 8 commits December 19, 2024 14:05

config error

d968c77

delete break

4b98374

add throws

42687e7

no need init wait time

0a4b8b7

add exception for flink sc

0364bd3

reformat

0999748

exception change

dce8f13

mr exception

99154e8

mridulm reviewed Dec 20, 2024

View reviewed changes

zaynt4606 added 6 commits December 23, 2024 14:01

change Exception to runtimeException

5bcf9fc

revert exception

f8cd555

move retry into rpc

bc6237a

Compatible with UT

12650d1

useless change

b0d6e58

interruptedException

f515888

turboFei reviewed Dec 23, 2024

View reviewed changes

zaynt4606 added 3 commits December 24, 2024 10:18

config

0451af2

networker config

015fbb3

revert import change by reformat

f317d6d

turboFei reviewed Jan 13, 2025

View reviewed changes

zaynt4606 added 3 commits January 20, 2025 11:44

update client-specified retryWait

ee720f3

network md file change

ebecf66

change waitTime to same name

09a7cdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

zaynt4606 commented Dec 19, 2024 •

edited

Loading

turboFei Dec 19, 2024

turboFei Dec 19, 2024

codecov bot commented Dec 19, 2024 •

edited

Loading

mridulm Dec 20, 2024

zaynt4606 Dec 23, 2024 •

edited

Loading

turboFei left a comment

turboFei Jan 13, 2025

turboFei Jan 13, 2025

turboFei Jan 13, 2025

turboFei Jan 13, 2025

turboFei Jan 13, 2025

turboFei Jan 13, 2025

zaynt4606 Jan 20, 2025

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

Are you sure you want to change the base?

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

Conversation

zaynt4606 commented Dec 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 19, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

zaynt4606 Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

turboFei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaynt4606 commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

zaynt4606 Dec 23, 2024 •

edited

Loading